# Lab 3:  groupby and more (Seaborn) plots

This lab explores the FBI NICS Firearms Background Check data, which records the number of background check made.  A background check must be made prior to *some* sales of firearms (a big exception is private sales.)  This data is often used as the best approximation of total gun sales at a given time.

BuzzFeed converts the PDF data supplied by the FBI to CSV files.

For more information on the dataset: [https://github.com/BuzzFeedNews/nics-firearm-background-checks](https://github.com/BuzzFeedNews/nics-firearm-background-checks)

For a direct link to the dataset (current as of July 2019):  [https://raw.githubusercontent.com/BuzzFeedNews/nics-firearm-background-checks/master/data/nics-firearm-background-checks.csv](https://raw.githubusercontent.com/BuzzFeedNews/nics-firearm-background-checks/master/data/nics-firearm-background-checks.csv)

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

pd.set_option('display.max_columns', None)

Read the CSV file into a dataframe called `guns`, and display the dataframe to make sure it was loaded correctly.

Make the `month` column into a `datetime` object.

There was no day in the original `month` column.  What happens to the day once we convert this column into a `datetime` object?

To get a feel for the data, plot the number of handgun background checks (the `handgun` column) made in New York on the y axis and the date on the x axis.

What do you notice about the plot?

What was the mean number of handgun background checks? 

### Groupby


What if we wanted to find the mean number of handgun checks for each state?  Our usual method of filtering would take a while.  Instead we will use the *group by* process, which:
- *splits* the data into groups based on some criteria
- *applies* a function to each group independently
- *combines* the results into a data structure

The splitting step is done by the function `groupby()` and a second function, like `mean()`, is applied to the groups.

In [None]:
guns.groupby("state").mean()

If we only wanted to see the `handgun` column, we can use:


In [None]:
guns.groupby("state").mean()["handgun"]

Other functions we can use with `groupby()` are:
- `mean()` : Compute mean of groups
- `sum()` : Compute sum of group values
- `size()` : Compute group sizes
- `count()` : Compute count of group
- `std()` : Standard deviation of groups
- `var()` : Compute variance of groups
- `describe()` : Generates descriptive statistics
- `min()` : Compute min of group values
- `max()` : Compute max of group values

For example, what is the standard deviation of long gun background checks in all states?

Notice that the output of `guns.groupby("state").mean()["handgun"]` looks a lot like the output of `value_counts()`.  We can use it to make a bar plot.  Try it below.

In [None]:
guns.groupby("state").mean()["handgun"].plot.bar()

<details> <summary>Answer:</summary>
guns.groupby("state").mean()["handgun"].plot.bar()
</details>

We can also use `groupby` for dates.  For example, to sum by month:

In [None]:
guns.groupby(guns["month"].dt.month).sum()

Which month has the most background checks for long guns?  For handgruns?

### Seaborn plotting

[Seaborn](https://seaborn.pydata.org) is a Python package for creating beautiful plots.

For example, suppose we want to make a scatter plot but use size and color to add more information to the plot.

In Pandas, make a scatter plot with number of handgun background checks on the x axis and number of long gun background checks on the y axis.

To make the same plot in Seaborn, we use the code:

In [None]:
sns.relplot(x ="handgun", y = "long_gun", data = guns)

To color the points by the state:

In [None]:
sns.relplot(x ="handgun", y = "long_gun", hue = "state", data = guns)

This plot is a little hard to interpret, so let's make a smaller dataset with only 5 states (whichever 5 you would like).

To size the circles by the total number of permit checks made that month:

In [None]:
sns.relplot(x ="handgun", y = "long_gun", hue = "state", size = "permit", data = guns5)

There are some large hand gun and long gun background check values.  What state are they from?

What are the maximum values in the `handgun` and `long_gun` columns?

Let's find a row containing the median handgun value 3280:

In [None]:
guns.loc[guns["handgun"] == 3280]

Now find the rows containing the maximum handgun and long_gun values.

Challenges

- make a hexagonal plot of just the Texas handgun vs. long_gun background check numbers
- choose another Seaborn plot from the [gallery](https://seaborn.pydata.org/examples/index.html).  Can you make it with using background check data?